25 research outputs found
Discriminative Speech Recognition Rescoring with Pre-trained Language Models
Second pass rescoring is a critical component of competitive automatic speech
recognition (ASR) systems. Large language models have demonstrated their
ability in using pre-trained information for better rescoring of ASR
hypothesis. Discriminative training, directly optimizing the minimum
word-error-rate (MWER) criterion typically improves rescoring. In this study,
we propose and explore several discriminative fine-tuning schemes for
pre-trained LMs. We propose two architectures based on different pooling
strategies of output embeddings and compare with probability based MWER. We
conduct detailed comparisons between pre-trained causal and bidirectional LMs
in discriminative settings. Experiments on LibriSpeech demonstrate that all
MWER training schemes are beneficial, giving additional gains upto 8.5\% WER.
Proposed pooling variants achieve lower latency while retaining most
improvements. Finally, our study concludes that bidirectionality is better
utilized with discriminative training.Comment: ASRU 202
Scaling Laws for Discriminative Speech Recognition Rescoring Models
Recent studies have found that model performance has a smooth power-law
relationship, or scaling laws, with training data and model size, for a wide
range of problems. These scaling laws allow one to choose nearly optimal data
and model sizes. We study whether this scaling property is also applicable to
second-pass rescoring, which is an important component of speech recognition
systems. We focus on RescoreBERT as the rescoring model, which uses a
pre-trained Transformer-based architecture fined tuned with an ASR
discriminative loss. Using such a rescoring model, we show that the word error
rate (WER) follows a scaling law for over two orders of magnitude as training
data and model size increase. In addition, it is found that a pre-trained model
would require less data than a randomly initialized model of the same size,
representing effective data transferred from pre-training step. This effective
data transferred is found to also follow a scaling law with the data and model
size
Personalization for BERT-based Discriminative Speech Recognition Rescoring
Recognition of personalized content remains a challenge in end-to-end speech
recognition. We explore three novel approaches that use personalized content in
a neural rescoring step to improve recognition: gazetteers, prompting, and a
cross-attention based encoder-decoder model. We use internal de-identified
en-US data from interactions with a virtual voice assistant supplemented with
personalized named entities to compare these approaches. On a test set with
personalized named entities, we show that each of these approaches improves
word error rate by over 10%, against a neural rescoring baseline. We also show
that on this test set, natural language prompts can improve word error rate by
7% without any training and with a marginal loss in generalization. Overall,
gazetteers were found to perform the best with a 10% improvement in word error
rate (WER), while also improving WER on a general test set by 1%
Generative Speech Recognition Error Correction with Large Language Models and Task-Activating Prompting
We explore the ability of large language models (LLMs) to act as speech
recognition post-processors that perform rescoring and error correction. Our
first focus is on instruction prompting to let LLMs perform these task without
fine-tuning, for which we evaluate different prompting schemes, both zero- and
few-shot in-context learning, and a novel task activation prompting method that
combines causal instructions and demonstration to increase its context windows.
Next, we show that rescoring only by in-context learning with frozen LLMs
achieves results that are competitive with rescoring by domain-tuned LMs, using
a pretrained first-pass recognition system and rescoring output on two
out-of-domain tasks (ATIS and WSJ). By combining prompting techniques with
fine-tuning we achieve error rates below the N-best oracle level, showcasing
the generalization power of the LLMs.Comment: Accepted to IEEE Automatic Speech Recognition and Understanding
(ASRU) 2023. 8 pages. 2nd version revised from Sep 29th's versio
PROCTER: PROnunciation-aware ConTextual adaptER for personalized speech recognition in neural transducers
End-to-End (E2E) automatic speech recognition (ASR) systems used in voice
assistants often have difficulties recognizing infrequent words personalized to
the user, such as names and places. Rare words often have non-trivial
pronunciations, and in such cases, human knowledge in the form of a
pronunciation lexicon can be useful. We propose a PROnunCiation-aware
conTextual adaptER (PROCTER) that dynamically injects lexicon knowledge into an
RNN-T model by adding a phonemic embedding along with a textual embedding. The
experimental results show that the proposed PROCTER architecture outperforms
the baseline RNN-T model by improving the word error rate (WER) by 44% and 57%
when measured on personalized entities and personalized rare entities,
respectively, while increasing the model size (number of trainable parameters)
by only 1%. Furthermore, when evaluated in a zero-shot setting to recognize
personalized device names, we observe 7% WER improvement with PROCTER, as
compared to only 1% WER improvement with text-only contextual attentionComment: To appear in Proc. IEEE ICASS
Low-rank Adaptation of Large Language Model Rescoring for Parameter-Efficient Speech Recognition
We propose a neural language modeling system based on low-rank adaptation
(LoRA) for speech recognition output rescoring. Although pretrained language
models (LMs) like BERT have shown superior performance in second-pass
rescoring, the high computational cost of scaling up the pretraining stage and
adapting the pretrained models to specific domains limit their practical use in
rescoring. Here we present a method based on low-rank decomposition to train a
rescoring BERT model and adapt it to new domains using only a fraction (0.08%)
of the pretrained parameters. These inserted matrices are optimized through a
discriminative training objective along with a correlation-based regularization
loss. The proposed low-rank adaptation Rescore-BERT (LoRB) architecture is
evaluated on LibriSpeech and internal datasets with decreased training times by
factors between 5.4 and 3.6.Comment: Accepted to IEEE ASRU 2023. Internal Review Approved. Revised 2nd
version with Andreas and Huck. The first version is in Sep 29th. 8 page